B&B with Local Heaps#1149
Conversation
Remove dependency on rmm::mr::device_memory_resource base class. Resources now satisfy the cuda::mr::resource concept directly. - Replace shared_ptr<device_memory_resource> with value types and cuda::mr::any_resource<cuda::mr::device_accessible> for type-erased storage - Replace set_current_device_resource(ptr) with set_current_device_resource_ref - Replace set_per_device_resource(id, ptr) with set_per_device_resource_ref - Remove make_owning_wrapper usage - Remove dynamic_cast on memory resources (no common base class) - Remove owning_wrapper.hpp and device_memory_resource.hpp includes - Add missing thrust/iterator/transform_output_iterator.h include (no longer transitively included via CCCL)
…nd deterministic mode. Signed-off-by: Nicolas Guidotti <224634272+nguidotti@users.noreply.github.com>
Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>
… shared_ptr to avoid unnecessary copy. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
…l crash in work-stealing Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
…queue for now. refactoring. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
… are present Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
# Conflicts: # cpp/src/utilities/cuda_helpers.cuh
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
# Conflicts: # ci/validate_wheel.sh
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
|
/ok to test 37e757a |
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
|
/ok to test b2e5f8c |
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
chris-maes
left a comment
There was a problem hiding this comment.
Thanks for the nice discussion @nguidotti and congratulations on the nice performance improvement with this PR. I'm removing the Request changes so that you can merge when ready.
A couple suggestions:
- It sounds like the only reason
node_queue_texposeslockandunlockis to allow a worker to steal a node from another. It would probably be better to add a method likenode_queue_t::steal_from_victim(node_queue_t& victim)and handle the locking and unlocking of both queues directly in this method. That would allow you to not exposenode_queue_t::lock/unlockand make it so that people touching the branch and bound code did not need to be concerned about correctly locking and unlocking the node queue. Insidesteal_from_victimyou can avoid deadlocks by acquiring the lock on the thief and the victim in sorted order according to their worker id (so this might need to benode_queue_t::steal_from_victim(i_t thief_id, i_t victim_id, node_queue_t& victim))
Ideally, stealing a node is an atomic operation, so that the node is always either in one queue or another, and thus the node's lower bound is always considered. If you are able to make it an atomic operation you can avoid the need to track the lower bound associated with the node separately (which may be prone to bugs).
Also, if diving needs to copy a node from the node queue, and that cannot happen while stealing, you can add a method node_queue_t::copy_node that acquires mutex_ internally.
Maybe you are able to make the above changes before merging.
- Longer term, I think it's worth defining the correct abstractions and data structures to make managing the lower bound simpler. A heap is already the ideal data structure for managing the lower bound, since it inherently takes the lower bound over the nodes it contains. I think we've introduced a lot of book-keeping and other data structures to manage nodes and lower bounds outside the heap. This is likely because we don't have a way to walk nodes in the heap (i.e. the standard C++ data structures only support pushing and popping). If we had a heap where we could walk nodes, I think it would simplify many of operations within the branch and bound code. Instead of popping a node off the heap, and tracking the lower bound, when solving, we could leave it on the heap and just mark that node as "solve in progress". We would only pop a node from the heap when the solve was completed. When trying to steal a node from a heap or dive from a node, thieves could avoid "solve in progress" nodes. Also, during a plunge we could push child nodes that we are not exploring directly onto the heap, instead of keeping them in a separate stack or circular buffer data structure.
If the code maintained the invariant that all open nodes are in a heap, I think it would be much easier to reason about the correctness of branch and bound.
…logic for launching new bfs workers and work stealing Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
…ressing the packed buffer Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
|
/ok to test d094751 |
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
|
/ok to test 6f7ab06 |
Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>
|
/ok to test d18c1e9 |
|
/merge |
In this PR, each best-first worker has its own local node heap, such that it push/pop nodes without synchronizing with other workers. Each best-first worker periodically steals a node from a random worker to keep the node distribution more or less balance across them. Additionally, each best-first worker has a (fixed) set of diving worker assigned to it, which are used for performing diving on its own nodes whenever possible. This essentially eliminates the need of the scheduler thread, freeing one additional thread to do something useful.
This also implements a compression scheme for
vstatususing only2bitsper entry, which reduces the memory consumption by roughly4x(previously was usingint8_tper entry). Last, but not least, this PR replacesstd::dequewith a fixed-capacitycircular_deque_tfor the plunge/dive stacks and the idle-worker list.MIPLIB results (GH200, 10min):
In summary, we explored
~3xnodes in average` at the same time frame. The number of optimal solutions also increased by 3.Checklist